Choosing the reference level wisely
For each categorical variable in a multiple regression model, the program considers one of the
categories to be the reference level and evaluates how each of the other levels affects the outcome,
relative to that reference level. Statistical software lets you specify the reference level for a
categorical variable, but you can also let the software choose it for you. The problem is that the
software uses some arbitrary algorithm to make that choice (such as whatever level sorts
alphabetically as first), and usually chooses one you don’t want. Therefore, it is better if you instruct
the software on the reference level to use for all categorical variables. For specific advice on
choosing an appropriate reference level, read the next section, “Recoding categorical variables as
numerical.”
Recoding categorical variables as numerical
Data may be stored as character variables — meaning the variable for primary diagnosis (PrimaryDx)
may be contain character data, such as Hypertension, Diabetes, Cancer, and Other. Because it is
difficult for statistical programs to work with character data, these variables are usually recoded with
a numerical code before being used in a regression. This means a new variable is created, and is
coded as 1 for hypertension, 2 for diabetes, 3 for cancer, and so on.
It is best to code binary variables as 0 for not having the attribute or state, and 1 for having the
attribute or state. So a binary variable named Cancer should be coded as Cancer = 1 if the
participant has cancer, and Cancer = 0 if they do not.
For categorical variables with more than two levels, it’s more complicated. Even if you recode the
categorical variable from containing characters to a numeric code, this code cannot be used in
regression unless we want to model the category as an ordinal variable. Imagine a variable coded as 1
= graduated high school, 2 = graduated college, and 3 = obtained post-graduate degree. If this variable
was entered as a predictor in regression, it assumes equal steps going from code 1 to code 2, and from
code 2 to code 3. Anyone who has applied to college or gone to graduate school knows these steps are
not equal! To solve this problem, you could select one level for the reference group (let’s choose 3),
and then create two binary indicator variables for the other two levels — meaning one for 1 =
graduated high school and 2 = graduated college. Here’s another example of coding multilevel
categorical variables as a set of indicator variables, where each level is assigned its own binary
variable that is coded 1 if the level applies to the row, and 0 if it does not (see Table 17-1).
TABLE 17-1 Coding a Multilevel Category into a Set of Binary Indicator
Variables
StudyID PrimaryDx
HTN Diab Cancer OtherDx
1
Hypertension 1
0
0
0
2
Diabetes
0
1
0
0
3
Cancer
0
0
1
0
4
Other
0
0
0
1
5
Diabetes
0
1
0
0